Pre-processing and EDA of Vaccinations data (India):

  • The date of vaccinations available from 2020-01-15 till 2021-07-14
  • Data Gathering of the below pre-processing and EDA steps from https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/India.csv
  • Below are the initial set of insights from the above dataset
  • This dataset is about the vaccination details specific to India
  • Details about variables/columns:
    • location: Name of country for the data about vaccinations
    • date: Date of the observation
    • vaccine: List of vaccines administered in the country as per the current date shown
    • source_url: URL link for the data/record
    • total_vaccinations: Total number of doses administered. If a person receives one dose of the vaccine, this metric goes up by 1. If they receive a second dose, it goes up by 1 again (people_vaccinated + people_fully_vaccinated).
    • people_vaccinated: (one-dose vaccination) Total number of people who received at least one vaccine dose. If a person receives the first dose of a 2-dose vaccine, this metric goes up by 1. If they receive the second dose, the metric stays the same.
    • people_fully_vaccinated: (two-doses vaccination) Total number of people who received all doses prescribed by the vaccination protocol. If a person receives the first dose of a 2-dose vaccine, this metric stays the same. If they receive the second dose, the metric goes up by 1.
  • This is a entire country-wise counts for the vaccination on a rolling daily basis of the above date range.
  • Other factors of covid for India states-wise available @ https://www.kaggle.com/imdevskp/covid19-corona-virus-india-dataset?select=complete.csv but from 2020-01-30 till 2020-08-06

To pre-process and explore datasets (EDA):

  • Below steps to know about data fields, their content and data types
  • Finding shape of the dataset, descriptive statistics, metadata about the DataFrame
  • Imputing missing values if any
  • Transform column based on the categorical columns or date field. Ex: Pandas provides date field as object type that requires to translate to datetime type especially for time-series analysis, encoding categorical values based on weighate of the variables for the model fitting
  • Keeping the naming conventions of variables/fields consistant if they are not
  • Segregating datasets to training and test sets